Capacity of Clustered Distributed Storage
Abstract
A new system model reflecting the clustered structure of distributed storage is suggested to investigate interplay between storage overhead and repair bandwidth as storage node failures occur. Large data centers with multiple racks/disks or local networks of storage devices (e.g. sensor network) are good applications of the suggested clustered model. In realistic scenarios involving clustered storage structures, repairing storage nodes using intact nodes residing in other clusters is more bandwidth-consuming than restoring nodes based on information from intra-cluster nodes. Therefore, it is important to differentiate between intra-cluster repair bandwidth and cross-cluster repair bandwidth in modeling distributed storage. Capacity of the suggested model is obtained as a function of fundamental resources of distributed storage systems, namely, node storage capacity, intra-cluster repair bandwidth and cross-cluster repair bandwidth. The capacity is shown to be asymptotically equivalent to a monotonic decreasing function of number of clusters, as the number of storage nodes increases without bound. Based on the capacity expression, feasible sets of required resources which enable reliable storage are obtained in a closed-form solution. Specifically, it is shown that the cross-cluster traffic can be minimized to zero (i.e., intra-cluster local repair becomes possible) by allowing extra resources on storage capacity and intra-cluster repair bandwidth, according to the law specified in the closed-form. The network coding schemes with zero cross-cluster traffic are defined as intra-cluster repairable codes, which are shown to be a class of the previously developed locally repairable codes.
Index Terms:
Capacity, Distributed storage, Network codingI Introduction
Many enterprises, including Google, Facebook, Amazon and Microsoft, use cloud storage systems in order to support massive amounts of data storage requests from clients. In the emerging Internet-of-Thing (IoT) era, the number of devices which generate data and connect to the network increases exponentially, so that efficient management of data center becomes a formidable challenge. However, since cloud storage systems are composed of inexpensive commodity disks, failure events occur frequently, degrading the system reliability [2].
In order to ensure reliability of cloud storage, distributed storage systems (DSSs) with erasure coding have been considered to improve tolerance against storage node failures [3, 4, 5, 6, 7, 8]. In such systems, the original file is encoded and distributed into multiple storage nodes. When a node fails, a newcomer node regenerates the failed node by contacting a number of survived nodes. This causes traffic burden across the network, taking up significant repair bandwidth. Earlier distributed storage systems utilized the 3-replication code: the original file was replicated three times, and the replicas were stored in three distinct nodes. The 3-replication coded systems require the minimum repair bandwidth, but incur high storage overhead. Reed-Solomon (RS) codes are also used (e.g. HDFS-RAID in Facebook [9]), which allow minimum storage overhead; however, RS-coded systems suffer from high repair bandwidth.
The pioneering work of [10] on distributed storage systems focused on the relationship between two required resources, the storage capacity of each node and the repair bandwidth , when the system aims to reliably store a file under node failure events. The optimal pairs are shown to have a fundamental trade-off relationship, to satisfy the maximum-distance-separable (MDS) property (i.e., any out of storage nodes can be accessed to recover the original file) of the system. Moreover, the authors of [10] obtained capacity , the maximum amount of reliably storable data, as a function of and . The authors related the failure-repair process of a DSS with the multi-casting problem in network information theory, and exploited the fact that a cut-set bound is achievable by network coding [11]. Since the theoretical results of [10], explicit network coding schemes [12, 13, 14] which achieve the optimal pairs have also been suggested. These results are based on the assumption of homogeneous systems, i.e., each node has the same storage capacity and repair bandwidth.
However, in real data centers, storage nodes are dispersed into multiple clusters (in the form of disks or racks) [15, 7, 8], allowing high reliability against both node and rack failure events. In this clustered system, repairing a failed node gives rise to both intra-cluster and cross-cluster repair traffic. While the current data centers have abundant intra-rack communication bandwidth, cross-rack communication is typically limited. According to [16], nearly a TB of cross-rack repair bandwidth is required everyday in the Facebook warehouse, limiting cross-rack communication for foreground map-reduce jobs. Moreover, surveys [17, 18, 19] on network traffic within data centers show that cross-rack communication is oversubscribed; the available cross-rack communication bandwidth is typically times lower than the intra-rack bandwidth in practical systems. Thus, a new system model which reflects the imbalance between intra- and cross-cluster repair bandwidths is required.
I-A Main Contributions
This paper suggests a new system model for clustered DSS to reflect the clustered nature of real distributed storage systems wherein an imbalance exists between intra- and cross-cluster repair burdens. This model can be applied to not only large data centers, but also local networks of storage devices such as the sensor networks or home clouds which are expected to be prevalent in the IoT era. This model is also more general in the sense that when the intra- and cross-cluster repair bandwidths are set to be equal, the resulting structure reduces to the original DSS model of [10]. This paper only considers recovering a single node failure at a time, as in [10]. The main contributions of this paper can be seen as twofold: one is the derivation of a closed-form expression for capacity, and the other is the analysis on feasible sets of system resources which enable reliable storage.
I-A1 Closed-form Expression for Capacity
Under the setting of functional repair, storage capacity of the clustered DSS is obtained as a function of node storage capacity , intra-cluster repair bandwidth and cross-cluster repair bandwidth . The existence of the cluster structure manifested as the imbalance between intra/cross-cluster traffics makes the capacity analysis challenging; Dimakis’ proof in [10] cannot be directly extended to handle the problem at hand. We show that symmetric repair (obtaining the same amount of information from each helper node) is optimal in the sense of maximizing capacity given the storage node size and total repair bandwidth, as also shown in [20] for the case of varying repair bandwidth across the nodes. However, we stress that in most practical scenarios, the need is greater for reducing cross-cluster communication burden, and we show that this is possible by trading with reduced overall storage capacity and/or increasing intra-repair bandwidth. Based on the derived capacity expression, we analyzed how the storage capacity changes as a function of , the number of clusters. It is shown that the capacity is asymptotically equivalent to , some monotonic decreasing function of .
I-A2 Analysis on Feasible Points
Given the need for reliably storing file , the set of required resource pairs, node storage capacity , intra-cluster repair bandwidth and cross-cluster repair bandwidth , which enables is obtained in a closed-form solution. In the analysis, we introduce , a useful parameter which measures the ratio of the cross-cluster repair burden (per node) to the intra-cluster burden. This parameter represents how scarce the available cross-cluster bandwidth is, compared to the abundant intra-cluster bandwidth. We here stress that the special case of corresponds to the scenario where repair is done only locally via intra-cluster communication, i.e., when a node fails the repair process requires intra-cluster traffic only without any cross-cluster traffic. Thus, the analysis on the case provides a guidance on the network coding for data centers for the scenarios where the available cross-cluster (cross-rack) bandwidth is very scarce.
Similar to the non-clustered case of [10], the required node storage capacity and the required repair bandwidth show a trade-off relationship. In the trade-off curve, two extremal points - the minimum-bandwidth-regenerating (MBR) point and the minimum-storage-regenerating (MSR) point - have been further analyzed for various values. Moreover, from the analysis on the trade-off curve, it is shown that the minimum storage overhead is achievable if and only if . This implies that in order to reliably store file with minimum storage , sufficiently large cross-cluster repair bandwidth satisfying is required. Finally, for the scenarios with the abundant intra-cluster repair bandwidth, the minimum required cross-cluster repair bandwidth to reliably store file is obtained as a function of node storage capacity .
I-B Related Works
Several researchers analyzed practical distributed storage systems with a goal in mind to reflect the non-homogeneous nature of storage nodes [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]. A heterogeneous model was considered in [20, 21] where the storage capacity and the repair bandwidth for newcomer nodes are generally non-uniform. Upper/lower capacity bounds for the heterogeneous DSS are obtained in [20]. An asymmetric repair process is considered in [22], coining the terms, cheap and expensive nodes, based on the amount of data that can be transfered to any newcomer. The authors of [23] considered a flexible distributed storage system where the amount of information from helper nodes may be non-uniform, as long as the total repair bandwidth is bounded from above. The view points taken in these works are different from ours in that we adopt a notion of cluster and introduce imbalance between intra- and cross-cluster repair burdens.
Recently, some researchers considered the clustered structure of data centers [1, 24, 25, 26, 27, 28, 29, 30, 31]. Some recent works [24, 25, 26, 27] provided new system models for clustered DSS and shed light on fundamental aspects of the suggested system. In [24], the idea of [22] is developed to a two-rack system, by setting the communication burden within a rack much lower than the burden across different racks, similar to our analysis. However, the authors of [24] only considered systems with two racks, while the current paper considers a general setting of racks (clusters), and provides mathematical analysis on how the number of clusters (i.e., the dispersion of nodes) affects the capacity of clustered distributed storage. Similar to the present paper, the authors of [25, 26] obtained the capacity of clustered distributed storage, and provided capacity-achieving regenerating coding schemes. However, the coding schemes considered in [25, 26] do not satisfy the MDS property, and the capacity expression is obtained for limited scenarios when intra-cluster repair bandwidth is set to its maximum value. In contrast, the current paper provides the capacity expression for general values of parameters, and analyzes the behavior of capacity as a function of the ratio between intra- and cross-cluster repair bandwidths. Moreover, unlike in the previous work, the capacity-achieving coding schemes (whose existence is shown) here satisfy the MDS property. In [27], the security issue in clustered distributed storage systems is considered, and the maximum amount of securely storable data in the existence of passive eavesdroppers is obtained.
There also have been some recent works [28, 29, 30, 31] on network code design appropriate for clustered distributed storage. Motivated by the limited available cross-rack repair bandwidth in real data centers, the work of [28] provides a network coding scheme which minimizes the cross-rack bandwidth in clustered distributed storage systems. However, the suggested coding scheme is applicable for some limited parameters and the minimum storage overhead () setting. On the other hand, the current paper provides the capacity expression for general setting, and proves the existence of capacity-achieving coding scheme. The authors of [29] proposed coding schemes tolerant to rack failure events in multi-rack storage systems, but have not addressed the imbalance between intra- and cross-cluster repair burdens in node failure events, which is an important aspect of the current paper. The authors of [30] considers the scenario of having grouped (clustered) storage nodes where nodes in the same group are more accessible to each other, compared to the nodes in other groups. However, the focus is different to the present paper: [30] focuses on the code construction which has the minimum amount of accessed data (called the optimal access property), while the scope of the present paper is on finding the optimal trade-off between the node storage capacity and the repair bandwidth, as a function of the ratio of intra- and cross-cluster communication burdens. Finally, a locally repairable code which can repair arbitrary node within each group is suggested in [31]; this code can suppress the inter-group repair bandwidth to zero. However, the coding scheme is suggested for the case only, while the present paper provides the capacity expression for general , and proves the existence of an optimal coding scheme.
Compared to the conference version [1] of the current work, this paper provides the formal proofs for the capacity expression, and obtains the feasible region for setting111The reason why the present paper considers this regime is provided in Section III-B. (only is considered in [1]). The present paper also shows the behavior of capacity as a function of , the number of clusters, and provides the sufficient and necessary conditions on , to achieve the minimum storage overhead . Finally, the asymptotic behaviors of the MBR/MSR points are investigated in this paper, and the connection between what we call the intra-cluster repairable codes and the existing locally repairable codes [32] is revealed.
I-C Organization
This paper is organized as follows. Section II reviews preliminary materials about distributed storage systems and the information flow graph, an efficient tool for analyzing DSS. Section III proposes a new system model for the clustered DSS, and derives a closed-form expression for the storage capacity of the clustered DSS. The behavior of the capacity curves is also analyzed in this section. Based on the capacity expression, Section IV provides results on the feasible resource pairs which enable reliable storage of a given file. Further research topics on clustered DSS are discussed in Section V, and Section VI draws conclusions.
II Background
II-A Distributed Storage System
Distributed storage systems can maintain reliability by means of erasure coding [33]. The original data file is spread into potentially unreliable nodes, each with storage size . When a node fails, it is regenerated by contacting helper nodes and obtaining a particular amount of data, , from each helper node. The amount of communication burden imposed by one failure event is called the repair bandwidth, denoted as . When the client requests a retrieval of the original file, assuming all failed nodes have been repaired, access to any out of nodes must guarantee a file recovery. The ability to recover the original data using any out of nodes is called the maximal-distance-separable (MDS) property. Distributed storage systems can be used in many applications such as large data centers, peer-to-peer storage systems and wireless sensor networks [10].
II-B Information Flow Graph
Information flow graph is a useful tool to analyze the amount of information flow from source to data collector in a DSS, as utilized in [10]. It is a directed graph consisting of three types of nodes: data source , data collector , and storage nodes as shown in Fig. 1. Storage node can be viewed as consisting of input-node and output-node , which are responsible for the incoming and outgoing edges, respectively. and are connected by a directed edge with capacity identical to the storage size of node .
Data from source is stored into nodes. This process is represented by edges going from to , where each edge capacity is set to infinity. A failure/repair process in a DSS can be described as follows. When a node fails, a new node joins the graph by connecting edges from survived nodes, where each edge has capacity . After all repairs are done, data collector chooses arbitrary nodes to retrieve data, as illustrated by the edges connected from survived nodes with infinite edge capacity. Fig. 1 gives an example of information flow graph representing a distributed storage system with .
II-C Notation used in the paper
This paper requires many notations related to graphs, because it deals with information flow graphs. Here we provide the definition of each notation used in the paper. For the given system parameters, we denote as the set of all possible information flow graphs. A graph is denoted as where is the set of vertices and is the set of edges in the graph. For a given graph , we call a set of edges as cut-set [34] if it satisfies the following: every directed path from to includes at least one edge in . An arbitrary cut-set is usually denoted as where and (the complement of ) satisfy the following: the set of edges from to is the given cut-set . The set of all cut-sets available in is denoted as . For a graph and a cut-set , we denote the sum of edge capacities for edges in as , which is called the cut-value of .
A vector is denoted as v using the bold notation. For a vector v, the transpose of the vector is denoted as . A set is denoted as , while a sequence is denoted as , or simply . For given sequences and , we use the term “ is asymptotically equivalent to ” [35] if and only if
| (1) |
We utilize a useful notation:
For a positive integer , we use as a simplified notation for the set . For a non-positive integer , we define . Each storage node is represented as either as defined in Section II-B, or as defined in (A.2). Finally, important parameters used in this paper are summarized in Table I.
| number of storage nodes | |
|---|---|
| number of DC-contacting nodes | |
| number of clusters | |
| number of nodes in a cluster | |
| number of intra-cluster helper nodes | |
| number of cross-cluster helper nodes | |
| base field which contains each symbol | |
| storage capacity of each node | |
| intra-cluster repair bandwidth (per node) | |
| cross-cluster repair bandwidth (per node) | |
| intra-cluster repair bandwidth | |
| cross-cluster repair bandwidth | |
| repair bandwidth | |
| ratio of to () | |
| ratio of to () | |
| ratio of to () |
III Capacity of Clustered DSS
III-A Clustered Distributed Storage System
A distributed storage system with multiple clusters is shown in Fig. 2. Data from source is stored at nodes which are grouped into clusters. The number of nodes in each cluster is fixed and denoted as . The storage size of each node is denoted as . When a node fails, a newcomer node is regenerated by contacting helper nodes within the same cluster, and helper nodes from other clusters. This paper considers functional repair[33] in the regeneration process; the newcomer node may store different content from that of the failed node, while maintaining the MDS property of the code. The amount of data a newcomer node receives within the same cluster is (each node equally contributes to ), and that from other clusters is (each node equally contributes to ). Fig. 3 illustrates an example of information flow graph representing the repair process in a clustered DSS.
III-B Assumptions for the System
We assume that and have the maximum possible values (), since this is the capacity-maximizing choice, as formally stated in the following proposition. The proof of the proposition is in Appendix G-A.
Proposition 1.
Consider a clustered distributed storage system with given and . Then, setting both and to their maximum values maximizes storage capacity.
Note that the authors of [10] already showed that in the non-clustered scenario with given repair bandwidth , maximizing the number of helper nodes is the capacity-maximizing choice. Here, we are saying that a similar property also holds for clustered scenario considered in the present paper. Under the setting of the maximum number of helper nodes, the overall repair bandwidth for a failure event is denoted as
| (2) |
Data collector contacts any out of nodes in the clustered DSS. Given that the typical intra-cluster communication bandwidth is larger than the cross-cluster bandwidth in real systems, we assume
throughout the present paper; this assumption limits our interest to . Moreover, motivated by the security issue, we assume that a file cannot be retrieved entirely by contacting any single cluster having nodes. Thus, the number of nodes contacted by the data collector satisfies
| (3) |
We also assume
| (4) |
which holds for most real DSSs. Usually, all storage nodes cannot be squeezed in a single cluster, i.e., rarely happens in practical systems, to prevent losing everything when the cluster is destroyed. Note that many storage systems [7, 8, 16] including those of Facebook uses , i.e., every storage node reside in different racks (clusters), to tolerate the rack failure events. Finally, according to [16], nearly of data recoveries in real systems deal with single node recovery. In other words, the portion of simultaneous multiple nodes failure events is small. Therefore, the present paper focuses single node failure events.
III-C The closed-form solution for Capacity
Consider a clustered DSS with fixed values. In this model, we want to find the set of feasible parameters () which enables storing data of size . In order to find the feasible set, min-cut analysis on the information flow graph is required, similar to [10]. Depending on the failure-repair process and nodes contacted by , various information flow graphs can be obtained.
Let be the set of all possible flow graphs. Consider a graph with minimum min-cut, the construction of which is specified in Appendix A. Based on the max-flow min-cut theorem in [11], the maximum information flow from source to data collector for arbitrary is greater than or equal to
which is called the capacity of the system. In order to send data from the source to the data collector, should be satisfied. Moreover, if is satisfied, there exists a linear network coding scheme [11] to store a file with size . Therefore, the set of points which satisfies is feasible in the sense of reliably storing the original file of size . Now, we state our main result in the form of a theorem which offers a closed-form solution for the capacity of the clustered DSS. Note that setting reduces to capacity of the non-clustered DSS obtained in [10].
Theorem 1.
The capacity of the clustered distributed storage system with parameters is
| (5) |
where
| (6) | ||||
| (7) |
The proof is in Appendix A. Note that the parameters used in the statement of Theorem 1 have the following property, the proof of which is in Appendix G-B.
Proposition 2.
For every with , we have
| (8) |
Moreover,
| (9) |
holds.
III-D Relationship between and
In this subsection, we analyze the capacity of a clustered DSS as a function of an important parameter
| (10) |
the cross-cluster repair burden per intra-cluster repair burden. In Fig. 4, capacity is plotted as a function of . From (2), the total repair bandwidth can be expressed as
| (11) |
Using this expression, the capacity is expressed as
| (12) |
For fair comparison on various values, capacity is calculated for a fixed () set. The capacity is an increasing function of as shown in Fig. 4. This implies that for given resources and , allowing a larger (until it reaches ) is always beneficial, in terms of storing a larger file. For example, under the setting in Fig. 4, allowing (i.e., ) can store , while setting (i.e., ) cannot achieve the same level of storage. This result is consistent with the previous work on asymmetric repair in [20], which proved that the symmetric repair maximizes capacity. Therefore, when the total communication amount is fixed, a loss of storage capacity is the cost we need to pay in order to reduce the communication burden across different clusters.
III-E Relationship between and
In this subsection, we analyze the capacity of a clustered DSS as a function of , the number of clusters. For fair comparison, () values are fixed for calculating capacity. In Fig. 5, capacity curves for two scenarios are plotted over a range of values. First, the solid line corresponds to the scenario when the system has abundant cross-rack bandwidth resources . In this ideal scenario which does not suffer from the over-subscription problem, the system can store irrespective of the dispersion of nodes.
However, consider a practical situation where the available cross-rack bandwidth is scarce compared to the intra-rack bandwidth; for example, . The dashed line in Fig. 5 corresponds to this scenario where the system has not enough cross-rack bandwidth resources. In this practical scenario, reducing (i.e., gathering the storage nodes into a smaller number of clusters) increases capacity. However, note that sufficient dispersion of data into a fair number of clusters is typically desired, in order to guarantee the reliability of storage in rack-failure events. Finding the optimal number of clusters in this trade-off relationship remains as an important topic for future research.
In Fig. 5, the capacity is a monotonic decreasing function of when the system suffers from an over-subscription problem. However, in general parameter settings, capacity is not always a monotonic decreasing function of . Theorem 2 illustrates the behavior of capacity as varies, focusing on the special case of Before formally stating the next main result, we need to define
| (13) |
the dispersion factor of a clustered storage system, as illustrated in Fig. 6. In the two-dimensional representation of a clustered distributed storage system, represents the number of rows (clusters), while represents the number of columns (nodes in each cluster). The dispersion factor is the ratio of the number of rows to the number of columns. If increases for a fixed , then grows and the nodes become more dispersed into multiple clusters.
Now we state our second main result, which is about the behavior of versus .
Theorem 2.
Consider the case when and are fixed. In the asymptotic regime of large , capacity is asymptotically equivalent to
| (14) |
a monotonically decreasing function of . This can also be stated as
| (15) |
as for a fixed .
Note that under the setting of Theorem 2, we have
in the asymptotic regime of large . The proof of Theorem 2 is based on the following two lemmas.
Lemma 1.
IV Discussion on Feasible ()
In the previous section, we obtained the capacity of the clustered DSS. This section analyzes the feasible () points which satisfy for a given file size . Using
in (2), the behavior of the feasible set of points can be observed. Essentially, the feasible points demonstrate a trade-off relationship. Two extreme points – minimum storage regenerating (MSR) point and minimum bandwidth regenerating (MBR) point – of the trade-off has been analyzed. Moreover, a special family of network codes which satisfy , which we call the intra-cluster repairable codes, is compared with the locally repairable codes considered in [32]. Finally, the set of feasible points is discussed, when the system allows maximum .
IV-A Set of Feasible Points
We provide a closed-form solution for the set of feasible points which enable reliable storage of data . Based on the range of defined in (10), the set of feasible points show different behaviors as stated in Corollary 1.
Corollary 1.
Consider a clustered DSS for storing data , when satisfies . For any , the data can be reliably stored, i.e., , while it is impossible to reliably store data when . The threshold function can be obtained as:
An example for the trade-off results of Corollary 1 is illustrated in Fig. 7, for various values. Here, the (i.e., ) case corresponds to the symmetric repair in the non-clustered scenario [10]. The plot for (or ) shows that the cross-cluster repair bandwidth can be reduced to zero with extra resources ( or ), where the amount of required resources are specified in Corollary 1. Note that in the case of , the storage system is completely localized, i.e., nodes can be repaired exclusively from their cluster. From Fig. 7, we can confirm that as decreases, extra resources ( or ) are required to reliably store the given data . Moreover, Corollary 1 suggests a mathematically interesting result, stated in the following Theorem, the proof of which is in Appendix B.
Theorem 3.
A clustered DSS can reliably store file with the minimum storage overhead if and only if
| (27) |
Note that is the minimum storage overhead which can satisfy the MDS property, as stated in [10]. The implication of Theorem 3 is shown in Fig. 7. Under the setting, data can be reliably stored with minimum storage overhead for , while it is impossible to achieve minimum storage overhead for . Finally, since reducing the cross-cluster repair burden is regarded as a much harder problem compared to reducing the intra-cluster repair burden , we also plotted feasible pairs for various values in Fig. 8. The plot for obviously has zero , while cases show a trade-off relationship. As increases, the minimum value increases gradually.
IV-B Minimum-Bandwidth-Regenerating (MBR) point and Minimum-Storage-Regenerating (MSR) point
According to Corollary 1, the set of feasible points shows a trade-off curve as in Fig. 9, for arbitrary settings. Here we focus on two extremal points: the minimum-bandwidth-regenerating (MBR) point and the minimum-storage-regenerating (MSR) point. As originally defined in [10], we call the point on the trade-off with minimum bandwidth as MBR. Similarly, we call the point with minimum storage as MSR222Among multiple points with minimum , the point having the smallest is called the MBR point. Similarly, among points with minimum , the point with the minimum is called the MSR point.. Let be the pair of the MSR point for given . Similarly, define as the parameter pair for MBR points. According to Corollary 1, the explicit expression for the MSR and MBR points are as in the following Corollary, the proof of which is given in Appendix F-B.
Corollary 2.
For a given , we have
| (28) |
| (29) |
Now we compare the MSR and MBR points for two extreme cases of and . Using and the dispersion ratio defined in (13), the asymptotic behaviors of MBR and MSR points are illustrated in the following theorem, the proof of which is in Appendix C.
Theorem 4.
Consider the MSR point and the MBR point for . The minimum node storage for is asymptotically equivalent to the minimum node storage for , i.e.,
| (30) |
as for arbitrary fixed and . Moreover, the MBR point for approaches the MBR point for , i.e.,
| (31) |
as . The ratio between and is expressed as
| (32) |
Note that under the setting of Theorem 4, we have
in the asymptotic regime of large . According to Theorem 4, the minimum storage for can achieve as with fixed . This result coincides with the result of Theorem 3. According to Theorem 3, the sufficient and necessary condition for achieving the minimum storage of is
| (33) |
As increases with a fixed , the lower bound on reduces, so that in the asymptotic regime, can achieve .
Moreover, Theorem 4 states that the MBR point for approaches the MBR point for as goes to 1. Fig. 11 provides two MBR coding schemes with , which has different values; one coding scheme in Fig. 11a satisfies , while the other in Fig. 11b satisfies . The RSKR coding scheme [12] is applied to the six nodes in Fig. 11a. Each node (illustrated as a rectangular box) contains five symbols, where each symbol consists of two sub-symbols, and . Note that any symbol is shared by exactly two nodes in Fig. 11a, which is due to the property of RSKR coding. This system can reliably store fifteen symbols , or sub-symbols , since it satisfies two properties – the exact repair property and the data recovery property – as illustrated below. First, when a node fails, five other nodes transmit five symbols (one distinct symbol by each node), which exactly regenerates the failed node. Second, we can retrieve data, sub-symbols , by contacting any nodes. In Fig. 11b, each node contains two symbols, where each symbol consists of five sub-symbols, . Note that in Fig. 11b, any symbol is shared by exactly two nodes which reside in the same cluster. This is because we applied RSKR coding at each cluster in the system of Fig. 11b. This system can reliably store six symbols , or sub-symbols , since it satisfies the exact repair property and the data recovery property.
Note that both DSSs in Fig. 11 reliably store sub-symbols, by using the node capacity of sub-symbols and the repair bandwidth of sub-symbols. However, the former system requires cross-cluster repair bandwidth for each node failure event, while the latter system requires cross-cluster repair bandwidth. For example, if the leftmost node of the cluster fails in Fig. 11a, then four sub-symbols are transmitted within that cluster, while six sub-symbols are transmitted from the cluster. In the case of in Fig. 11b, ten sub-symbols are transmitted within the cluster and no sub-symbols are transmitted across the clusters, when the leftmost node of the cluster fails. Thus, transition from the former system (Fig. 11a) to the latter system (Fig. 11b) reduces the cross-cluster repair bandwidth to zero, while maintaining the storage capacity and the required resource pair . Likewise, we can reduce the cross-cluster repair bandwidth to zero while maintaining the storage capacity, in the case of .
Note that for , from (29). Thus, the result of (32) in Theorem 4 can be expressed as Fig. 12 at the asymptotic regime of large . According to Fig. 12, intra-cluster only repair ( or ) is possible by using additional resources ( and ) in the portion, compared to the symmetric repair () case.
IV-C Intra-cluster Repairable Codes versus Locally Repairable Codes [32]
Here, we define a family of network coding schemes for clustered DSS, which we call the intra-cluster repairable codes. In Corollary 1, we considered DSSs with , which can repair any failed node by using intra-cluster communication only. The optimal trade-off curve which satisfies is illustrated as the solid line with cross markers in Fig. 7. Each point on the curve is achievable (i.e., there exists a network coding scheme), according to the result of [11]. We call the network coding schemes for the points on the curve of the intra-cluster repairable codes, since these coding schemes can repair any failed node by using intra-cluster communication only. The relationship between the intra-cluster repairable codes and the locally repairable codes (LRC) of [32] are investigated in Theorem 5, the proof of which is given in Appendix D. Note that according to the definition in [32], an -LRC encodes a file of size into coded symbols, where each symbol contains bits. In addition, any coded symbol of the LRC is regenerated by accessing at most other symbols (i.e., the code has repair locality of ), while the minimum distance of the code is .
IV-D Required for a given
Here we focus on the following question: when the available intra-cluster repair bandwidth is abundant, how much cross-cluster repair bandwidth is required to reliably store file ? We consider scenarios when the intra-cluster repair bandwidth (per node) has its maximum value, i.e., . Under this setting, Theorem 6 specifies the minimum required which satisfies . The proof of Theorem 6 is given in Appendix E.
Theorem 6.
Suppose the intra-cluster repair bandwidth is set at the maximum value, i.e., . For a given node capacity , the clustered DSS can reliably store data if and only if where
| (35) | ||||
| (36) | ||||
| (37) |
Fig. 13 provides an example of the optimal trade-off relationship between and , explained in Theorem 6. For , the cross-cluster burden can be reduced to zero. However, as decreases from , the system requires a larger value. For example, if in Fig. 13, is required to satisfy . Thus, for each node failure event, a cross-cluster repair bandwidth of is required. Theorem 6 provides an explicit equation for the cross-cluster repair bandwidth we need to pay, in order to reduce node capacity .
V Further Comments & Future Works
V-A Explicit coding schemes for clustered DSS
According to the part I proof of Theorem 1, there exists an information flow graph which has the min-cut value of , the capacity of clustered DSS. Thus, according to [11], there exists a linear network coding scheme which achieves capacity . Although the existence of a coding scheme is verified, explicit network coding schemes which achieve capacity need to be specified for implementing practical systems. Recently, under the setting of clustered DSS modeled in the present paper, MBR codes for all are constructed in [36] and MSR codes for limited parameters are designed in [37]. Explicit code construction for general parameters and/or construction of codes that requires small field size are interesting remaining issues.
V-B Optimal number of clusters
According to Theorem 2, capacity is asymptotically a monotonically decreasing function of , the number of clusters. Thus, reducing the number of clusters (i.e., gathering storage nodes into a smaller number of clusters) increases storage capacity. However, as mentioned in Section III-E, we typically want to have a sufficiently large , to tolerate the failure of a cluster. Then, the remaining question is in finding optimal which not only allows sufficiently large storage capacity, but also a tolerance to cluster failures. We regard this problem as a future research topic, the solution to which will provide a guidance on the strategy for distributing storage nodes into multiple clusters.
V-C Extension to general settings
The present paper assumed a maximum helper node setting, and , since it maximizes the capacity as stated in Proposition 1. However, waiting for all helper nodes gives rise to a latency issue. If we reduce the number of helper nodes and , low latency repair would be possible, while the achievable storage capacity decreases. Thus, we consider obtaining the capacity expression for general settings, and discover the trade-off between capacity and latency for various values.
V-D Scenarios of aggregating the helper data within each cluster
In recent work [28], [38] on multi-rack (multi-cluster) distributed storage, the authors discuss aggregation of repair data leaving a given cluster. The ideas is to allow aggregation and compression of all helper data leaving each cluster to aid reconstruction taking place in some other cluster containing a failed node. This type of repair link aggregation has been shown to reduce the cross-cluster repair burden in [28], [38]. We expect that the same method would also change the tradeoff picture for our distributed cluster model. This is certainly an interesting and important topic to investigate, but careful analysis including the effect of security breach on links will provide a more complete assessment of the merits and potential perils of repair link aggregation. We will leave this as a future endeavor.
VI Conclusion
This paper considered a practical distributed storage system where storage nodes are dispersed into several clusters. Noticing that the traffic burdens of intra- and cross-cluster communications are typically different, a new system model for clustered distributed storage systems is suggested. Based on the cut-set bound analysis of information flow graph, the storage capacity of the suggested model is obtained in a closed-form, as a function of three main resources: node storage capacity , intra-cluster repair bandwidth and cross-cluster repair bandwidth . It is shown that the asymmetric repair () degrades capacity, which is the cost for lifting the cross-cluster repair burden. Moreover, in the asymptotic regime of a large number of storage nodes, capacity is shown to be asymptotically equivalent to a monotonic decreasing function of , the number of clusters. Thus, reducing (i.e., gathering nodes into less clusters) is beneficial for increasing capacity, although we would typically need to guarantee sufficiently large to tolerate rack failure events.
Using the capacity expression, we obtained the feasible set of () triplet which satisfies , i.e., it is possible to reliably store file by using the resource value set (). The closed-form solution on the feasible set shows a different behavior depending on , the ratio of cross- to intra-cluster repair bandwidth. It is shown that the minimum storage of is achievable if and only if . Moreover, in the special case of , we can construct a reliable storage system without using cross-cluster repair bandwidth. A family of network codes which enable , called the intra-cluster repairable codes, has been shown to be a class of the locally repairable codes defined in [32].
Appendix A Proof of Theorem 1
Here, we prove Theorem 1. First, denote the right-hand-side (RHS) of (5) as
| (A.1) |
For other notations used in this proof, refer to subsection II-C. The proof proceeds in two parts.
Part I. Show an information flow graph and a cut-set such that :
Consider the information flow graph illustrated in Fig. 14, which is obtained by the following procedure. First, data from source node is distributed into nodes labeled from to . As mentioned in Section II-B, the storage node consists of an input-node and an output-node . Second, storage node fails and is regenerated at the newcomer node for . The newcomer node connects to survived nodes to regenerate . Third, data collector node contacts to retrieve data. This whole process is illustrated in the information flow graph .
To specify , here we determine the 2-dimensional location of the newcomer nodes . First, consider the 2-dimensional structure representation of clustered distributed storage, illustrated in Fig. 15. In this figure, each row represents each cluster, and each node is represented as a 2-dimensional point for and . The symbol denotes the node at point. Here we define the set of nodes,
| (A.2) |
For , consider selecting the newcomer node as
| (A.3) |
where
| (A.4) | ||||
| (A.5) |
and used in the method is defined in (7). The location of newcomer nodes selected by this method are illustrated in Fig. 16. Moreover, for the case, the newcomer nodes are depicted in Fig. 17. In these figures, the node with number inside represents the newcomer node labeled as .
For the given graph , now we consider a cut-set defined as below. The cut-set can be defined by specifying and (complement of ), which partition the set of vertices in . First, let for and for . For , the input node is included in either or , depending on the condition specified in the next paragraph. Moreover, let and . See Fig. 18.
Let . For , let be the sum of capacities of edges from to . If , then we include in . Otherwise, we include in . Then, the cut-set has the cut-value of
| (A.6) |
All that remains is to show that (A.6) is equal to the expression in (A.1). In other words, we will obtain the expression for .
Recall that in the generation process of , any newcomer node connects to helper nodes to regenerate . Among the helper nodes, the nodes reside in the same cluster with , while the nodes are in other clusters. From our system setting in Section III-A, the helper nodes in the same cluster as the failed node help by , while the helper nodes in other clusters help by . Therefore, the total repair bandwidth to regenerate any failed node is
| (A.7) |
as in (2).
The newcomer node connects to , all of which are included in . Therefore, holds. Next, connects to nodes from and one node from . Define variable as the repair bandwidth from to . Then, . From (A.3), we have and . Therefore, and are in different clusters, which result in . Therefore,
In general, connects to nodes from , and nodes from . Thus, for can be expressed as where
Recall Fig. 16. For arbitrary newcomer node , the set contains nodes which reside in the same cluster with , and nodes in other clusters. Therefore, can be expressed as
where is defined in (A.4). Combined with (A.7) and (A.5), we get
Then, (A.6) can be expressed as
| (A.8) |
where . From the definition of in (A.4), we have
Thus, for . Therefore, (A) can be expressed as
which is identical to in (A.1), where used in this equation is defined in (6). Therefore, the specified information flow graph and the specified cut-set satisfy .
Part II. Show that for every information flow graph and for every cut-set , the cut-value is greater than or equal to in (A.1). In other words, , we have .
The proof is divided into 2 sub-parts: Part II-1 and Part II-2.
Part II-1. Show that we have where is in (A.14):
Consider an arbitrary information flow graph and an arbitrary cut-set of the graph . Denote the cut-set as . Consider an output node connected to . If , then the cut-value is infinity, which is a trivial case for proving . Therefore, the output nodes connected to are assumed to be in . In other words, at least output nodes exist in . Note that every directed acyclic graph can be topologically sorted [34], where vertex is followed by vertex if there exists a directed edge from to . Consider labeling the topologically first output nodes in as . Similar to the notation for a storage node in Section II-B, we denote the storage node which contains as . Then, the set of ordered tuples can be represented as
| (A.9) |
We also define , the sum of capacities of edges from to . See Fig. 19.
If , then the cut-set should include the edge from to , which has the edge capacity . Otherwise (i.e., ), the cut-set should include the edges from to . If node is directly connected to the source node , the cut-value is infinity (trivial case for proving ). Therefore, node is assumed to be a newcomer node helped by helper nodes. Note that all helper nodes of are in , since is the topologically first output node in . Thus, the cut-set should include the edges from to , where the sum of capacities of these edges are
If , then the cut-set should include the edge from to , which has the edge capacity . Otherwise (i.e., ), the cut-set should include the edges from to . As we discussed in the case of , we can assume is a newcomer node helped by helper nodes. Since is the topologically second node in , it may have one helper node ; at least helper nodes in help to generate . Note that the total amount of data coming into is , while the amount of information coming from to , denoted , is as follows: if and are in the same cluster, , otherwise . Recall that the cut-set should include the edges from to . The sum of capacities of these edges are
while the equality holds if and only if helps . In a similar way, for , can be bounded as
| (A.10) |
where
| (A.11) | ||||
| (A.12) |
The equality in (A.10) holds if and only if helps for .
Thus, contributes at least to the cut value, for . In summary, for arbitrary graph , an arbitrary cut-set has cut-value of at least :
| (A.13) |
Note that depends on the relative position of , which is determined when an arbitrary information flow graph and arbitrary cut-set are specified. This relationship is illustrated in Fig. 20. Therefore, we define
| (A.14) |
for arbitrary and arbitrary . Combining (A.13) and (A.14) completes the proof part II-1.
Part II-2. :
Assume that and are fixed. See Fig. 20. Note that for a given graph and a cut-set with , the sequence of topologically first output nodes in is determined. Moreover, for a given sequence , we have a fixed , which determines in (A.14). Thus, can be obtained by finding the optimal sequence which minimizes . It is identical to finding the optimal , the sequence of different nodes out of existing nodes in the system. Therefore, based on (A.14) and (A.11), we have
| (A.15) |
where
| (A.16) |
holds as defined in (A.12), and is defined in (A). In order to obtain the solution for RHS of (A.15), all we need to do is to find the optimal sequence of different nodes, which can be divided into two sub-problems: finding the optimal way of selecting nodes out of nodes, and finding the optimal order of selected nodes. Note that there are selection methods and ordering methods. Each selection method can be assigned to a selection vector defined in Definition 1, and each ordering method can be assigned to an ordering vector defined in Definition 2.
First, we define a selection vector for a given .
Definition 1.
Assume arbitrary nodes are selected as . Label each cluster by the number of selected nodes in a descending order. In other words, the cluster contains a maximum number of selected nodes, and the cluster contains a minimum number of selected nodes. Under this setting, define the selection vector where is the number of selected nodes in the cluster.
Fig. 21 shows an example of selection vector corresponding to the selected nodes . From the definition of the selection vector, the set of possible selection vectors can be specified as follows.
Note that even though different selections exist, the values in (A.11) are only determined by the corresponding selection vector . This is because depends only on the relative positions of , whether they are in the same cluster or in different clusters. Therefore, comparing the values of all possible selection vectors is enough; it is not necessary to compare the values of selection methods. Now, we define the ordering vector for a given selection vector .
Definition 2.
Let the locations of nodes be fixed with a corresponding selection vector . Then, for arbitrary ordering of the selected nodes, define the ordering vector where is the index of the cluster which contains .
For a given , the ordering vector corresponding to an arbitrary ordering of nodes is illustrated in Fig. 22. In this figure (and the following figures in this paper), the number written inside each node means that the node is . From the definition, an ordering vector has components with value , for all . The set of possible ordering vectors can be specified as
| (A.17) |
Note that for given selected nodes, there exists different ordering methods. However, the values in (A.11) are only determined by the corresponding ordering vector , by similar reasoning for compressing selection methods to selection vectors. Therefore, comparing the values of all possible ordering vectors is enough; it is not necessary to compare the values of all ordering methods.
Thus, finding the optimal sequence is identical to specifying the optimal () pair, which is obtained as follows. Recall that from the definition of in Definition 2, holds if and only if and are in the same cluster. Therefore, in (A.16) can be expressed by using notation:
| (A.18) |
Thus, the term in (A.15) can be expressed as
| (A.19) |
Combining (2), (A.15) and (A.19), we have
where
| (A.20) | ||||
| (A.21) | ||||
| (A.22) |
Therefore, the rest of the proof in part II-2 shows that
| (A.23) |
holds. We begin by stating a property of seen in (A.21).
Proposition 3.
Consider a fixed selection vector . We claim that is constant irrespective of the ordering vector .
Proof.
Let . For an arbitrary ordering vector , let where is as given in (A.22). For simplicity, we denote , and as , and , respectively. Then,
| (A.24) |
for fixed . Note that
| (A.25) |
from (A.22). Also, from the definition of in (A.17), an arbitrary ordering vector has components with value , for all . If we define
| (A.26) |
then holds for . Then,
Therefore,
for fixed . Combining with (A.25),
| (A.27) |
for fixed . From (A.24) and (A.27) we get
Therefore, from (A.21),
for every ordering vector , if and are fixed. ∎
Now, we define a special ordering vector called vertical ordering vector, which is shown to be the optimal ordering vector which minimizes for an arbitrary selection vector .
Definition 3.
For a given selection vector , the corresponding vertical ordering vector , or simply denoted as , is defined as the output of Algorithm 1.
The vertical ordering vector is illustrated in Fig. 23, for a given selection vector as an example. For , Algorithm 1 produces the corresponding vertical ordering vector . Note that the order of output nodes is illustrated in Fig. 23, as the numbers inside each node. Although the vertical ordering vector depends on the selection vector , we use simplified notation instead of . From Fig. 23, obtaining using Algorithm 1 can be analyzed as follows. Moving from the leftmost column to the rightmost column, the algorithm selects one node per cluster. After selecting all nodes, stores the index of the cluster which contains the selected node. Now, the following Lemma shows that the vertical ordering vector is optimal in the sense of minimizing for an arbitrary selection vector .
Lemma 3.
Let be an arbitrary selection vector. Then, a vertical ordering vector minimizes . In other words, holds for arbitrary .
Proof.
In the case of , we show that is constant for every . From (A.20),
| (A.28) |
holds for . Using in (A.26), note that can be partitioned into disjoint subsets as Therefore, (A.28) can be written as
Recall that contains components with value for . Thus, from (A.22),
for . Therefore, is constant for arbitrary . In conclusion, in (A.28) is constant irrespective of for .
The rest of the proof deals with the case. For a given arbitrary , define two subsets of as
| (A.29) | |||
| (A.30) |
where is the running sum of . Here, we call the running sum maximizer and the min-cut minimizer. Now the proof proceeds in two steps. The first step proves that the running sum maximizer minimizes min-cut, i.e., . The second step proves that the vertical ordering vector is a running sum maximizer, i.e., .
Step 1. Prove :
Define two index sets for a given ordering vector :
| (A.31) |
Now define a set of ordering vectors as
| (A.32) |
The rest of the proof is divided into 2 sub-steps.
Step 1-1. Prove and by transposition:
Consider arbitrary . Use a short-hand notation to represent for . From (A.32), there exists such that and . Therefore, there exists such that and hold. By (A),
| (A.33) |
for some . Note that from (A.21), implies . Therefore,
| (A.34) |
should hold to satisfy (A.33). Define an ordering vector as
| (A.35) |
Use a short-hand notation to represent for . Note that satisfies
| (A.36) |
for the following reason. First, use simplified notations and to mean and , respectively. Then, using (A.22), (A.34) and (A.35), we have
| (A.37) |
Similarly,
| (A.38) |
Therefore, from (A.11), (A) and (A), we have
Similarly, holds. This proves (A.36). Thus, from (A.33) and (A.36),
Therefore, holds for . In other words, if , then . This proves that holds for .
Similarly, can be proved as follows. For the pre-defined ordering vectors and , we have and . Using (A.33), we have , so that cannot be a running-sum maximizer. Therefore, holds.
Step 1-2. Prove that , :
Consider arbitrary and . For , let and be short-hand notations for and , respectively. Note that from Proposition 3,
| (A.39) |
Let
| (A.40) |
Then, from (A.29),
| (A.41) |
Combining with (A.39), we obtain
| (A.42) |
Note that from the result of Step 1-1, both and are in . Therefore, for . Similarly, for . Therefore, (A.20) can be expressed as
| (A.43) | ||||
| (A.44) |
If , then we have
from (A.42). If , we get
From (A), holds for . Therefore,
from (A.42). In the case of , define and . From (A.43),
Similarly, from (A.44),
where the last inequality is from (A). Combined with (A.39) and (A.41), we obtain
In summary, irrespective of the values, which completes the proof for Step 1-2. From the results of Step 1-1 and Step 1-2, the relationship between the sets can be depicted as in Fig. 24.
Consider and . Then, from the result of Step 1-2. Based on the definition of in (A.30), we can write for every In other words, holds for arbitrary . Therefore, holds.
Step 2. Prove :
For a given selection vector , consider an arbitrary ordering vector . The corresponding defined in (A.21) is written as
| (A.45) |
where .
Consider a set of lines , where line represents an equation: . Since we assume , these lines can be illustrated as in Fig. 25. For a given , consider marking a point for . Note that the point is on line if and only if
| (A.46) |
where the summation term in (A.46) represents the number of occurrence of value in . For the example in Fig. 23, when and , line contains the point since .
Recall that , as defined in (A.26), where holds . For , consider with . Let be the smallest element in . Then, and hold. Thus, the point is on line . Similarly, we can find
| (A.47) |
points on line , irrespective of the ordering vector . Note that
| (A.48) |
which confirms that Fig. 25 contains points. Moreover,
| (A.49) |
holds from the definition in (A.47).
In order to maximize the running sum for every , the optimal ordering vector packs points in the leftmost area (), pack points in the leftmost remaining area (), and so on. This packing method corresponds to Fig. 26.
Note that from the definition of in (A.47) and Fig. 23, vertical ordering in Definition 3 first chooses points on line , then chooses points on line , and so on. Thus, achieves optimal packing in Fig. 26, which maximizes the running sum . Therefore, vertical ordering is a running sum maximizer, i.e., . Combining Steps 1 and 2, we conclude that minimizes among for arbitrary . ∎
Now, we define a special selection vector called the horizontal selection vector, which is shown to be the optimal selection vector which minimizes .
Definition 4.
The horizontal selection vector is defined as:
The graphical illustration of the horizontal selection vector is on the left side of Fig. 27, in the case of . The following Lemma states that the horizontal selection vector minimizes .
Lemma 4.
Consider applying the vertical ordering vector . Then, the horizontal selection vector minimizes the lower bound on the min-cut. In other words, .
Proof.
From the proof of Lemma 3, the optimal ordering vector turns out to be the vertical ordering vector, where the corresponding sequence is illustrated in Fig. 26. Depending on the selection vector , the number of points on each line changes.
Consider an arbitrary selection vector . Define a point vector where is the number of points on , as defined in (A.47). Similarly, define . Using Definition 4 and (A.47), we have
| (A.50) |
Now, we prove
| (A.51) |
The proof is divided into two steps: base case and inductive step.
Base Case: We wish to prove that . Suppose (i.e., ). Then,
| (A.52) |
where the first inequality is from (A.49). Note that if , then
| (A.53) |
Otherwise,
| (A.54) |
Therefore, combining (A.52), (A.53) and (A.54) results in , which contradicts (A.48). Therefore, holds.
Inductive Step: Assume that for arbitrary . Now we prove that holds. Suppose not. Then,
| (A.55) |
holds where
| (A.56) |
Using (A.49) and (A.55), we have
| (A.57) |
where equality holds for the last inequality iff . Using analysis similar to (A.53) and (A.54) for the base case, we can find that Combining with (A.57), we get
| (A.58) |
Equations (A.56) and (A.58) imply which contradicts (A.48). Therefore, (A.51) holds.
Now define
| (A.59) | ||||
| (A.60) |
for . Then,
| (A.61) |
holds directly from (A.51). Note that since in (A.50) is identical to in (7), can be written as
| (A.62) |
Consider , the vertical ordering vector for a given selection vector . Recall that as in Fig. 26, vertical ordering packs the leftmost points on line , packs the next points on line , and so on. Using (A.45), we can write
Therefore, using (A.59), we further write
| (A.63) |
Similarly,
| (A.64) |
Combining (A.61), (A.63) and (A.64), we have
since we assume . Therefore, combining with (A.20), we conclude that for arbitrary , which completes the proof of Lemma 4. ∎
All that remains is to compute and check that it is identical to (A.1).
Appendix B Proof of Theorem 3
Proposition 4.
| (B.1) | ||||
| (B.2) |
Proof.
First, consider the case. From (20), data can be reliably stored with node storage if the repair bandwidth satisfies , where
where the last equality is from (23) and (B.2). Thus, is achievable with finite , when .
Second, we prove that it is impossible to achieve for , in order to reliably store file . Recall that the minimum storage for is
| (B.3) |
from (28). From (22) and (B.2), we have for . Therefore,
holds, which result in
| (B.4) |
Thus, the case has the minimum storage greater than , which completes the proof of Theorem 3.
Appendix C Proof of Theorem 4
First, we prove (30). To begin with, we obtain the expression of , for . From (28), we obtain
| (C.1) | ||||
We further simplify the expression for as follows. Recall
| (C.2) |
for from (25), when holds. Note that we have
| (C.3) |
from the following reason. First, from (26) and (C.2), holds for
where the last equality is from (9) and (7). Similarly, we can prove that holds for . From (C.3) and (22), we obtain
| (C.4) |
when . Combining (C.1), (C.3) and (C.4), we have
| (C.5) |
Then, using and ,
Thus, for arbitrary fixed and ,
Therefore, is asymptotically equivalent to .
Second, we prove (31). Note that from (29), holds for arbitrary . Therefore, all we need to prove is
To begin with, we obtain the expression for , when . For , in (25) is
| (C.6) |
for . Moreover, from (23),
| (C.7) |
for . Therefore, from (29), (C.6) and (C.7),
| (C.8) |
Now we focus on the case of . First, let and be
| (C.9) | ||||
| (C.10) |
which represent the quotient and remainder of . Note that
| (C.11) |
Then, we have
| (C.12) |
where the last equality is from (C.11). From (C.2) and (A.66), we have
| (C.13) |
where the second last equality is from (9) and (C). Moreover, using (C.11), we have
| (C.14) |
where the equality holds if and only if . Furthermore,
| (C.15) |
for from (23). Combining (29), (C), (C) and (C.15) result in
| (C.16) |
Appendix D Proof of Theorem 5
Recall the definition of an -LRC which appear right before Theorem 5. Moreover, recall that the repair locality of a code is defined as the number of nodes to be contacted in the node repair process [32]. Since each cluster contains nodes, every node in a DSS with has the repair locality of
| (D.1) |
Moreover, note that for any code with minimum distance , the original file can be retrieved by contacting coded symbols [38]. Since the present paper considers DSSs such that contacting any nodes can retrieve the original file , we have the minimum distance of
| (D.2) |
Thus, the intra-cluster repairable code defined in Section IV-C is a -LRC.
Now we show that (34) holds. Note that from Fig. 10, we obtain for , where
| (D.3) |
holds according to (C.1). Thus, (34) is proven by showing
| (D.4) |
By plugging (D.1), (D.2) and (D.3) into (D.4), we have
Therefore, all we need to prove is
| (D.5) |
which is proved as follows. Using and defined in (C.9) and (C.10), the right-hand-side (RHS) of (D.5) is
Thus, (D.5) holds, where the equality condition is , or equivalently . Therefore, (34) holds, where the equality condition is and . This completes the proof of Theorem 5.
Appendix E Proof of Theorem 6
According to (A.20) and (A.68) in Appendix B, the capacity can be expressed as
| (E.1) |
Using (A.64) and (A.66), in (E.1), or simply has the following property:
| (E.2) |
where Note that from (7). Therefore, in (37) can be expressed as
| (E.3) |
where the last inequality is from (3). Combining (26) and (E.3) result in
| (E.4) |
From (A.64), (E.3) and (E.4), we have
| (E.5) |
where the last equality holds due to the assumption of in the setting of Theorem 6. Since is a decreasing sequence from (E.2), the result of (E) implies that
| (E.6) |
Thus, the capacity expression in (E.1) can be expressed as
| (E.7) |
Note that from (A.64) and (A.66), we have for . Therefore, in (E.7) is
| (E.8) |
which is illustrated as a piecewise linear function of in Fig. 28. Based on (E.8), the sequence in this figure has the following expression:
| (E.9) |
Appendix F Proofs of Corollaries
F-A Proof of Corollary 1
From the proof of Theorem 1, the capacity expression is equal to (A), which is
where is defined in (A.60). Using and (2), this can be rewritten as
| (F.1) |
Using and defined in (25) and (24), the capacity expression reduces to
| (F.2) |
which is a continuous function of .
Proof.
Moreover, note that holds from the definition of and in Table I. Thus, combined with , it is shown that in (2) is lower-bounded as
Here, we define
Then, the valid region of is expressed as as illustrated in Figs. 29 and 30. The rest of the proof depends on the range of values; we first consider the case, and then consider the case.
F-A1 If
Using (B.2), holds. Combining with (24), we have or equivalently, If for some , then (F.2) can be expressed as
where is defined in (23). If , then
If , then In summary, capacity is
| (F.3) |
which is illustrated in Fig. 29. Since is a decreasing sequence from Remark 1, we have for . Thus, defined in (23) is a monotonically decreasing, non-negative sequence. This implies that the curve in Fig. 29 is a monotonic increasing function of .
F-A2 Otherwise (if )
Using (B.2),
| (F.6) |
holds. Since is a decreasing sequence from Remark 1, there exists such that holds, or equivalently,
F-B Proof of Corollary 2
First, we focus on the MSR point illustrated in Fig. 9. From (20), the MSR point for is
| (F.9) |
where the last equality is from in (23). Moreover, from (21), the MSR point for is
| (F.10) |
Equations (F.9) and (F.10) proves (28). The expression (29) for the MBR point is directly obtained from Corollary 1 and Fig. 9.
Appendix G Proofs of Propositions
G-A Proof of Proposition 1
Consider a general setting, where each newcomer node is helped by nodes in the same cluster, receiving information from each node, and nodes in other clusters, receiving information from each node. Under this setting, the coefficient of in (G.2) cannot exceed . Similarly, the coefficient of in (G.2) cannot exceed . Thus, the capacity for general is expressed as
| (G.3) |
where
| (G.4) | ||||
Consider arbitrary fixed and . Since and are fixed in the basic setting of Proposition 1, only and are variables in (G.4), while other parameters are constants. Then, (G.4) can be expressed as
| (G.5) |
where and are constants. Note that
| (G.6) |
is a non-increasing function of . Thus, in (G.5) is a non-decreasing function of for arbitrary fixed and . Since the maximum value is , we have
| (G.7) |
for . In other words, for all , we have
Similarly, for all ,
holds. Therefore, for all ,
| (G.8) |
Let
| (G.9) |
Then, from (G.3), (G.8) and (G.9),
| (G.10) |
for all and . Therefore, choosing and maximizes storage capacity when the available resources, and , are given.
G-B Proof of Proposition 2
First, we prove (8). Recall and defined in (6) and (7). Consider the support set , which is defined as
| (G.11) |
Then, we have
| (G.12) | ||||
| (G.13) |
for every . Therefore, by combining (G.13) and (6),
| (G.14) |
holds for every . Combining (2), (G.12) and (G-B) results in
| (G.15) |
for arbitrary . Since holds for , we conclude that (G.15) holds for with , .
Appendix H Proofs of Lemmas
H-A Proof of Lemma 1
Using (2), in (14) can be expressed as
| (H.1) |
According to (A.20) and (A.68) in Appendix B, the capacity can be expressed as
| (H.2) |
From (A.21), we have
| (H.3) |
for . Therefore, when , the capacity expression in (H.2) reduces to
| (H.4) |
Recall (A.64):
| (H.5) |
From the expression of in (A.66), we have
| (H.6) |
since from (9),
Using (H.5) and (H.6), the capacity expression in (H.4) is expressed as
| (H.7) |
where and
| (H.8) |
Similarly, in (H.1) can be expressed as
| (H.9) |
where and
| (H.10) |
H-B Proof of Lemma 2
Recall that and value are all fixed. The expression for in (14) can be expressed as
| (H.13) |
Note that (H.13) is a monotonic decreasing function of . Moreover, we consider the case, as mentioned in (4). Thus, is upper/lower bounded by expressions for and , respectively:
| (H.14) |
Therefore, holds. Moreover, the expression for in (17) is
| (H.15) |
Putting into (H.15), we get
References
- [1] J. y. Sohn, B. Choi, S. W. Yoon, and J. Moon, “Capacity of clustered distributed storage,” in 2017 IEEE International Conference on Communications (ICC), May 2017, pp. 1–7.
- [2] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The google file system,” in ACM SIGOPS operating systems review, vol. 37, no. 5. ACM, 2003, pp. 29–43.
- [3] R. Bhagwan, K. Tati, Y. Cheng, S. Savage, and G. M. Voelker, “Total recall: System support for automated availability management.” in NSDI, vol. 4, 2004, pp. 25–25.
- [4] F. Dabek, J. Li, E. Sit, J. Robertson, M. F. Kaashoek, and R. Morris, “Designing a dht for low latency and high throughput.” in NSDI, vol. 4, 2004, pp. 85–98.
- [5] S. C. Rhea, P. R. Eaton, D. Geels, H. Weatherspoon, B. Y. Zhao, and J. Kubiatowicz, “Pond: The oceanstore prototype.” in FAST, vol. 3, 2003, pp. 1–14.
- [6] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop distributed file system,” in Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on. IEEE, 2010, pp. 1–10.
- [7] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, and S. Yekhanin, “Erasure coding in windows azure storage,” in Presented as part of the 2012 USENIX Annual Technical Conference (USENIX ATC 12), 2012, pp. 15–26.
- [8] S. Muralidhar, W. Lloyd, S. Roy, C. Hill, E. Lin, W. Liu, S. Pan, S. Shankar, V. Sivakumar, L. Tang et al., “f4: Facebook’s warm blob storage system,” in 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), 2014, pp. 383–398.
- [9] D. Borthakur, R. Schmidt, R. Vadali, S. Chen, and P. Kling, “Hdfs raid,” in Hadoop User Group Meeting, 2010.
- [10] A. G. Dimakis, P. B. Godfrey, Y. Wu, M. J. Wainwright, and K. Ramchandran, “Network coding for distributed storage systems,” IEEE Transactions on Information Theory, vol. 56, no. 9, pp. 4539–4551, 2010.
- [11] R. Ahlswede, N. Cai, S.-Y. Li, and R. W. Yeung, “Network information flow,” IEEE Transactions on information theory, vol. 46, no. 4, pp. 1204–1216, 2000.
- [12] K. Rashmi, N. B. Shah, P. V. Kumar, and K. Ramchandran, “Explicit construction of optimal exact regenerating codes for distributed storage,” in Communication, Control, and Computing, 2009. Allerton 2009. 47th Annual Allerton Conference on. IEEE, 2009, pp. 1243–1249.
- [13] K. V. Rashmi, N. B. Shah, and P. V. Kumar, “Optimal exact-regenerating codes for distributed storage at the msr and mbr points via a product-matrix construction,” IEEE Transactions on Information Theory, vol. 57, no. 8, pp. 5227–5239, 2011.
- [14] N. B. Shah, K. Rashmi, P. V. Kumar, and K. Ramchandran, “Interference alignment in regenerating codes for distributed storage: Necessity and code constructions,” IEEE Transactions on Information Theory, vol. 58, no. 4, pp. 2134–2158, 2012.
- [15] D. Ford, F. Labelle, F. I. Popovici, M. Stokely, V.-A. Truong, L. Barroso, C. Grimes, and S. Quinlan, “Availability in globally distributed storage systems.” in OSDI, 2010, pp. 61–74.
- [16] K. Rashmi, N. B. Shah, D. Gu, H. Kuang, D. Borthakur, and K. Ramchandran, “A solution to the network challenges of data recovery in erasure-coded distributed storage systems: A study on the facebook warehouse cluster.” in HotStorage, 2013.
- [17] F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. Vijaykumar, “Shufflewatcher: Shuffle-aware scheduling in multi-tenant mapreduce clusters,” in 2014 USENIX Annual Technical Conference (USENIX ATC 14), 2014, pp. 1–13.
- [18] T. Benson, A. Akella, and D. A. Maltz, “Network traffic characteristics of data centers in the wild,” in Proceedings of the 10th ACM SIGCOMM conference on Internet measurement. ACM, 2010, pp. 267–280.
- [19] A. Vahdat, M. Al-Fares, N. Farrington, R. N. Mysore, G. Porter, and S. Radhakrishnan, “Scale-out networking in the data center,” Ieee Micro, vol. 30, no. 4, pp. 29–41, 2010.
- [20] T. Ernvall, S. El Rouayheb, C. Hollanti, and H. V. Poor, “Capacity and security of heterogeneous distributed storage systems,” IEEE Journal on Selected Areas in Communications, vol. 31, no. 12, pp. 2701–2709, 2013.
- [21] Q. Yu, K. W. Shum, and C. W. Sung, “Tradeoff between storage cost and repair cost in heterogeneous distributed storage systems,” Transactions on Emerging Telecommunications Technologies, vol. 26, no. 10, pp. 1201–1211, 2015.
- [22] S. Akhlaghi, A. Kiani, and M. R. Ghanavati, “A fundamental trade-off between the download cost and repair bandwidth in distributed storage systems,” in 2010 IEEE International Symposium on Network Coding (NetCod). IEEE, 2010, pp. 1–6.
- [23] N. B. Shah, K. Rashmi, and P. V. Kumar, “A flexible class of regenerating codes for distributed storage,” in Information Theory Proceedings (ISIT), 2010 IEEE International Symposium on. IEEE, 2010, pp. 1943–1947.
- [24] B. Gastón, J. Pujol, and M. Villanueva, “A realistic distributed storage system that minimizes data storage and repair bandwidth,” arXiv preprint arXiv:1301.1549, 2013.
- [25] N. Prakash, V. Abdrashitov, and M. Médard, “A generalization of regenerating codes for clustered storage systems,” in Communication, Control, and Computing (Allerton), 2016 54th Annual Allerton Conference on, 2016.
- [26] ——, “The storage vs repair-bandwidth trade-off for clustered storage systems,” arXiv preprint arXiv:1701.04909, 2017.
- [27] B. Choi, J.-y. Sohn, S. W. Yoon, and J. Moon, “Secure clustered distributed storage against eavesdroppers,” arXiv preprint arXiv:1702.07498, 2017.
- [28] Y. Hu, X. Li, M. Zhang, P. P. Lee, X. Zhang, P. Zhou, and D. Feng, “Optimal repair layering for erasure-coded data centers: From theory to practice,” ACM Transactions on Storage (TOS), vol. 13, no. 4, p. 33, 2017.
- [29] G. Calis and O. O. Koyluoglu, “Architecture-aware coding for distributed storage: Repairable block failure resilient codes,” arXiv preprint arXiv:1605.04989, 2016.
- [30] M. Ye and A. Barg, “Explicit constructions of optimal-access mds codes with nearly optimal sub-packetization,” IEEE Transactions on Information Theory, vol. 63, no. 10, pp. 6307–6317, 2017.
- [31] I. Tamo, D. S. Papailiopoulos, and A. G. Dimakis, “Optimal locally repairable codes and connections to matroid theory,” IEEE Transactions on Information Theory, vol. 62, no. 12, pp. 6661–6671, 2016.
- [32] D. S. Papailiopoulos and A. G. Dimakis, “Locally repairable codes,” IEEE Transactions on Information Theory, vol. 60, no. 10, pp. 5843–5855, 2014.
- [33] A. G. Dimakis, K. Ramchandran, Y. Wu, and C. Suh, “A survey on network codes for distributed storage,” Proceedings of the IEEE, vol. 99, no. 3, pp. 476–489, 2011.
- [34] J. Bang-Jensen and G. Z. Gutin, Digraphs: theory, algorithms and applications. Springer Science & Business Media, 2008.
- [35] A. Erdélyi, Asymptotic Expansions, ser. Dover Books on Mathematics. Dover Publications, 1956. [Online]. Available: https://books.google.co.kr/books?id=aedk-OHdmNYC
- [36] J.-y. Sohn and J. Moon, “Explicit construction of mbr codes for clustered distributed storage,” arXiv preprint arXiv:1801.02287, 2018.
- [37] J.-y. Sohn, B. Choi, and J. Moon, “A class of msr codes for clustered distributed storage,” arXiv preprint arXiv:1801.02014, 2018.
- [38] T. K. Moon, “Error correction coding,” Mathematical Methods and Algorithms. Jhon Wiley and Son, 2005.
| Jy-yong Sohn (S’15) received the B.S. and M.S. degrees in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 2014 and 2016, respectively. He is currently pursuing the Ph.D. degree in KAIST. His research interests include coding for distributed storage and distributed computing, massive MIMO effects on wireless multi cellular system and 5G Communications. He is a recipient of the IEEE international conference on communications (ICC) best paper award in 2017. |
| Beongjun Choi (S’17) received the B.S. and M.S. degrees in mathematics and electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 2014 and 2017. He is currently pursuing the electrical engineering Ph.D degree in KAIST. His research interests include error-correcting codes, distributed storage system and information theory. He is a co-recipient of the IEEE international conference on communications (ICC) best paper award in 2017. |
| Sung Whan Yoon (M’17) received the M.S. and Ph.D. degrees in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea, in 2013 and 2017 respectively. He is currently a postdoctoral researcher in KAIST from 2017. His research interests are in the area of coding theory, distributed system and artificial intelligence, with focusing on polar codes, distributed storage system and meta-learning algorithm of neural network. Especially for the area of artificial intelligence, his primary interests include information theoretic analysis and algorithmic development of meta-learning. He was a co-recipient of the IEEE International Conference on Communications best Paper Award in 2017. |
| Jaekyun Moon (F’05) received the Ph.D degree in electrical and computer engineering at Carnegie Mellon University, Pittsburgh, Pa, USA. He is currently a Professor of electrical engineering at KAIST. From 1990 through early 2009, he was with the faculty of the Department of Electrical and Computer Engineering at the University of Minnesota, Twin Cities. He consulted as Chief Scientist for DSPG, Inc. from 2004 to 2007. He also worked as Chief Technology Officer at Link-A-Media Devices Corporation. His research interests are in the area of channel characterization, signal processing and coding for data storage and digital communication. Prof. Moon received the McKnight Land-Grant Professorship from the University of Minnesota. He received the IBM Faculty Development Awards as well as the IBM Partnership Awards. He was awarded the National Storage Industry Consortium (NSIC) Technical Achievement Award for the invention of the maximum transition run (MTR) code, a widely used error-control/modulation code in commercial storage systems. He served as Program Chair for the 1997 IEEE Magnetic Recording Conference. He is also Past Chair of the Signal Processing for Storage Technical Committee of the IEEE Communications Society. He served as a guest editor for the 2001 IEEE JSAC issue on Signal Processing for High Density Recording. He also served as an Editor for IEEE TRANSACTIONS ON MAGNETICS in the area of signal processing and coding for 2001-2006. He is an IEEE Fellow. |